Guided Project: Predicting board game reviews
Posted on Wed 08 July 2015 in Projects
import pandas
board_games = pandas.read_csv("board_games.csv")
board_games = board_games.dropna(axis=0)
board_games = board_games[board_games["users_rated"] > 0]
board_games.head()
%matplotlib inline
import matplotlib.pyplot as plt
plt.hist(board_games["average_rating"])
print(board_games["average_rating"].std())
print(board_games["average_rating"].mean())
Error metric¶
In this data set, using mean squared error as an error metric makes sense. This is because the data is continuous, and follows a somewhat normal distribution. We'll be able to compare our error to the standard deviation to see how good the model is at predictions.
from sklearn.cluster import KMeans
clus = KMeans(n_clusters=5)
cols = list(board_games.columns)
cols.remove("name")
cols.remove("id")
cols.remove("type")
numeric = board_games[cols]
clus.fit(numeric)
import numpy
game_mean = numeric.apply(numpy.mean, axis=1)
game_std = numeric.apply(numpy.std, axis=1)
labels = clus.labels_
plt.scatter(x=game_mean, y=game_std, c=labels)
Game clusters¶
It looks like most of the games are similar, but as the game attributes tend to increase in value (such as number of users who rated), there are fewer high quality games. So most games don't get played much, but a few get a lot of players.
correlations = numeric.corr()
correlations["average_rating"]
Correlations¶
The yearpublished
column is surprisingly highly correlated with average_rating
, showing that more recent games tend to be rated more highly. Games for older players (minage
is high) tend to be more highly rated. The more "weighty" a game is (average_weight
is high), the more highly it tends to be rated.
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
cols.remove("average_rating")
cols.remove("bayes_average_rating")
reg.fit(board_games[cols], board_games["average_rating"])
predictions = reg.predict(board_games[cols])
numpy.mean((predictions - board_games["average_rating"]) ** 2)
Game clusters¶
The error rate is close to the standard deviation of all board game ratings. This indicates that our model may not have high predictive power. We'll need to dig more into which games were scored well, and which ones weren't.